Medical Data Integration Pipeline

ETL Demonstration: Messy CSV → Clean HL7 FHIR R4 Resources

Using Official German MII (Medizininformatik-Initiative) Test Data

View on Codeberg View on GitHub
87.7%
Data Recovery
896
FHIR Resources
100%
Success Rate
0
Errors

About This Project

Purpose: This demonstration project showcases the capability to transform messy hospital data into FHIR-compliant format with comprehensive quality assurance.

Approach: Uses official German MII (Medizininformatik-Initiative) test data to demonstrate understanding of the German medical informatics ecosystem and interoperability standards.

Project Overview

This project demonstrates a complete ETL pipeline that transforms messy hospital data into HL7 FHIR R4 compliant resources, using official German MII test data from the Medizininformatik-Initiative.

"Reverse Engineering" Approach for Validation

Started with clean MII FHIR → Created messy CSV with realistic quality issues → Transformed back to clean FHIR → Validated against original gold standard

Phase 1 Extract & "Break" FHIR → Messy CSV
Phase 2 Transform & Clean CSV → Clean FHIR
Phase 3 Validate Compare to Gold Standard
Result Quality Dashboard Metrics & Visualizations

Data Quality Issues Handled

ID Formats P-001, PAT001, 00001 → PAT-00001
Date Formats DD.MM.YYYY, YYYY-MM-DD → ISO 8601
Gender Values M, male, männlich → standardized
Missing Data 10-15% nulls → handled gracefully
German Characters ä, ö, ü, ß → UTF-8 encoding
Reference Integrity Invalid IDs → logged & flagged
Tech Stack: Python • pandas • fhir.resources (FHIR R4) • pytest • JavaScript/CSS/HTML
Data Source: Official MII Kerndatensatz Test Data (1,650 FHIR bundles)

Transformation Results

Success Rates by Resource Type

Resource Counts

Patients 200 / 200
Conditions 400 / 400
Medications 115 unique
MedicationAdministrations 181 / 181
Total FHIR Resources 896

Transformation Viewer

See exactly how the pipeline transforms messy CSV input into clean FHIR R4 resources. Each example uses real data from the pipeline.

Messy CSV Input
PatientIDWP1-00001
FirstName(empty)
LastNameMustermann
Birthdate01.01.1950
GenderMale
Street(empty)
CityBonn
PostalCode(empty)
Clean FHIR R4 Output
id00001
given(omitted)
familyMustermann
birthDate1950-01-01
gendermale
line(omitted)
cityBonn
postalCode(omitted)
Transformations applied:
  • ID normalized: WP1-0000100001 (stripped prefix)
  • Date converted: 01.01.1950 (DD.MM.YYYY) → 1950-01-01 (ISO 8601)
  • Gender standardized: Malemale (FHIR ValueSet)
  • Missing fields (FirstName, Street, PostalCode) gracefully omitted from FHIR output
Messy CSV Input
PatientID00004
FirstName(empty)
LastNameMustermann
Birthdate1953/01/01
GenderW
StreetMusterstraße 1
CityBonn
PostalCode53121
Clean FHIR R4 Output
id00004
given(omitted)
familyMustermann
birthDate1953-01-01
genderfemale
lineMusterstraße 1
cityBonn
postalCode53121
Transformations applied:
  • Date converted: 1953/01/01 (YYYY/MM/DD) → 1953-01-01 (ISO 8601)
  • Gender mapped: W (German abbreviation for weiblich) → female (FHIR ValueSet)
  • German character ß preserved correctly in UTF-8 address
Messy CSV Input
PatientID00006
FirstNameMaxi_03
LastName(empty)
Birthdate(empty)
Genderweiblich
Street(empty)
CityBonn
PostalCode53121
Clean FHIR R4 Output
id00006
givenMaxi_03
family(omitted)
birthDate(omitted — no data)
genderfemale
line(omitted)
cityBonn
postalCode53121
Transformations applied:
  • Gender mapped: weiblich (German for female) → female (FHIR ValueSet)
  • Missing LastName, Birthdate, Street — gracefully omitted from FHIR resource
  • This patient demonstrates maximum missing data handling (4 of 8 fields empty)
Messy CSV Input
ConditionIDWP1-1
PatientIDPAT00002
CodeM80.01
CodeSystem(empty)
DisplayOsteroporose
RecordedDate2019/01/05
Clean FHIR R4 Output
resourceTypeCondition
code.systemicd-10-gm
code.codeM80.01
code.displayOsteroporose
subjectPatient/00002
clinicalStatusactive
Transformations applied:
  • Missing CodeSystem inferred from ICD-10 code format → http://fhir.de/CodeSystem/bfarm/icd-10-gm
  • Patient reference normalized: PAT00002Patient/00002
  • Clinical status defaulted to active (not present in CSV)
  • Date converted: 2019/01/05 → ISO 8601 format
Messy CSV Input
MedicationIDPAT1
PatientIDWP1-00001
MedicationCodeN06AA09
MedicationNameATC_AMITRYPTILIN
DoseValue1
DoseUnit(empty)
Status(empty)
EffectiveDate01-01-2019
Clean FHIR R4 Output
resourceTypeMedicationAdministration
statuscompleted
medicationMedication/Medication-N06AA09
subjectPatient/00001
occurenceDateTime2019-01-01
Transformations applied:
  • Patient reference normalized: WP1-00001Patient/00001
  • ATC code N06AA09 linked to Medication resource reference
  • Missing Status defaulted to completed
  • Date converted: 01-01-2019 (DD-MM-YYYY) → 2019-01-01 (ISO 8601)

Data Quality Metrics

Validation Against Original FHIR

Quality Scores (click to expand)

Correctly matched 178 patients
Missing in source CSV (~11% randomly nulled) 22 patients
Input variations handled M, m, Male, male, W, Female, weiblich, männlich
Gap is entirely due to deliberately introduced missing values (~11% null rate). All present gender values were correctly mapped to the FHIR ValueSet (male/female).
Correctly matched 173 patients
Missing FirstName in CSV 23 patients (11.5%)
Missing LastName in CSV 27 patients (13.5%)
Name mismatches are caused by deliberately nulled fields. When either first or last name is missing from the messy CSV, the reconstructed FHIR name cannot match the original. This is expected and by design.
Correctly matched 135 patients
Missing in source CSV (~13% randomly nulled) 26 patients
Unparseable date formats 26 patients
Ambiguous dates (DD-MM vs MM-DD) ~13 patients
Formats successfully parsed DD.MM.YYYY, YYYY/MM/DD, YYYY-MM-DD
The lowest score, and intentionally so: dates are the hardest data quality challenge. Ambiguous formats like 01-05-2019 (Jan 5 or May 1?) cannot be resolved without context. This is a realistic real-world ETL problem.
Valid ICD-10-GM codes 366 conditions
Missing codes in source CSV (~8.5% nulled) 34 conditions
Missing CodeSystem (inferred from code) 47 conditions
When the CodeSystem column was empty but a valid ICD-10 code was present (e.g. M80.01), the pipeline inferred http://fhir.de/CodeSystem/bfarm/icd-10-gm from the code format.
Overall Data Recovery 87.7% EXCELLENT
Quality Score 100/100 PASSED
Transformation Errors 0

Data Quality Issues Handled

Issue Type Example Input (Messy) Output (Clean FHIR) Status
ID Format Variations P-00001, PAT00002, 00003, WP1-00004 00001, 00002, 00003, 00004 Normalized
Date Format Variations 01.01.1950, 1950/01/01, 01-01-1950 1950-01-01 (ISO 8601) Normalized
Gender Inconsistencies M, m, male, männlich, W, weiblich male, female (FHIR valueSet) Normalized
Missing Data ~12% random missing values Handled with defaults or omitted Partial
German Characters Müller, Schröder, Weiß UTF-8 preserved Preserved

Requirements Alignment

How this project and my broader experience map to core data engineering requirements.

Requirement
Demonstrated in This Project
Additional Experience
ETL/ELT Pipelines
Spark & NiFi
Project Complete ETL pipeline: FHIR → messy CSV → clean FHIR with validation. Production architecture section shows Spark + NiFi design.
Experience Large-scale ETL/ELT pipelines for ingesting, cleaning, enriching and indexing 413K+ texts into NoSQL databases and Elasticsearch.
SQL & Databases
Datenbanken-Know-How
Project HAPI FHIR Server (PostgreSQL) proposed in production architecture; data modeling for FHIR resources.
Experience Flask web application with SQL and NoSQL databases. Elasticsearch query design and index optimization for large-scale collections.
Medical Data Standards
FHIR, LOINC, OMOP
Project HL7 FHIR R4 resources, ICD-10-GM codes, ATC classification, MII Kerndatensatz profiles. OMOP CDM in production architecture.
Experience Analogous standards work: Dublin Core metadata, cross-source harmonisation, schema mapping, and semantic normalisation across heterogeneous datasets.
Programming
Python, SQL, R, Java
Project Python (pandas, fhir.resources, pydantic), pytest, JavaScript/HTML/CSS. ~2,500 lines of tested code.
Experience 2+ years production Python (Flask, NLP, ML). SQL for relational database work.
Distributed & Cloud
Kubernetes, Spark, Ansible
Project Production architecture proposes Spark cluster with horizontal scaling and fault tolerance.
Experience Deployment to Flemish Supercomputer Center (VSC) Tier-1 cloud: environment configuration, deployment coordination, operational constraints.
Quality Assurance
Qualitätssicherung
Project Validation against gold standard, data quality drill-downs, 87.7% recovery rate with full transparency on gaps and causes.
Experience End-to-end data quality throughout research pipelines. Published digital humanities work requiring rigorous, auditable data provenance.
Documentation
Lebende Dokumentation
Project This interactive dashboard is the living documentation: transformation viewer, quality drill-downs, architecture rationale.
Experience Extensive documentation practice across research publications. Structured, goal-oriented communication with both technical and non-technical stakeholders.
Healthcare Domain
Medical context
Project German medical informatics context: MII test data, German terminology, ICD-10-GM. Built from understanding of real DIC workflows.
Experience PhD in History of Medicine. Healthcare consulting (social determinants of health modeling). Python desktop application for a clinical researcher.

Project = demonstrated in this repository    Experience = from professional background

From Demo to Production Architecture

This demo uses a single-threaded Python pipeline. Below is how the same logic would be architected for production-scale medical data integration at a university hospital.

📝
Manual Execution
python3 src/pipeline.py
🐍
Python Script
Sequential processing
1 core, ~2 GB RAM
📄
CSV → FHIR
200 patients
pandas + fhir.resources
💾
Local Filesystem
JSON files on disk

Suitable for demonstration, prototyping, and validation of transformation logic.

Proposed Architecture
🔄
Apache NiFi
Orchestration & ingestion
Data provenance tracking
Apache Spark
Distributed transformation
4–8 nodes, 200K+ patients
🏥
HAPI FHIR Server
Production data store
REST API + search
📊
Analytics Layer
Parquet / Data Lake
Research queries

Click any component to learn why it is needed

Aspect Demo Pipeline Production Pipeline
Orchestration Manual CLI execution NiFi flow-based scheduling & monitoring
Processing Single-threaded Python / pandas Spark distributed across cluster nodes
Scale 200 patients (~30 s) 200K+ patients, horizontal scaling
Storage Local JSON files HAPI FHIR Server + Parquet data lake
Fault Tolerance Script re-run on failure Automatic retry, checkpointing, dead-letter queues
Monitoring Console output + log files NiFi dashboard, Spark UI, alerting
Research Export JSON files for validation OMOP CDM + FHIR Bulk Export for multi-site studies

Based on industry best practices for FHIR data integration at German university hospitals and the MII consortium architecture.

Technical Implementation

Technologies Used

Programming Language Python 3.11
FHIR Library fhir.resources 7.1.0
Data Processing pandas, pydantic
Standards FHIR R4, ICD-10-GM, ATC
Lines of Code ~2,500

Key Features

  • Data quality validation (completeness, consistency)
  • Referential integrity checks
  • German medical terminology (ICD-10-GM, ATC)
  • MII Kerndatensatz profile compliance
  • Comprehensive error handling
  • Validation against source data

How to Run This Project

Quick Start

# 1. Clone the repository git clone https://codeberg.org/YOUR-USERNAME/YOUR-REPO cd YOUR-REPO # 2. Install dependencies pip install -r requirements.txt # 3. Run the extraction (FHIR → CSV) python3 src/fhir_to_csv_extractor.py # 4. Run the transformation pipeline (CSV → FHIR) PYTHONPATH=. python3 src/pipeline.py # 5. Validate against original FHIR PYTHONPATH=. python3 src/validate_against_original.py # 6. View this dashboard open dashboard/index.html

Requirements: Python 3.9+, ~2GB disk space for test data
Execution time: ~2-3 minutes for full pipeline (200 patients)